Overview

Assignment: This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

1. Data Exploration

1.1 Load data

#Load train data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentData.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_beverage <- read_excel(temp_file)
#Copy data
beverage_df <- data_beverage
#Clean temp file
unlink(temp_file)

#Load test data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentEvaluation.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_eval <- read_excel(temp_file)
#Copy data
eval_df <- data_eval

1.2 Summary Statistics

Beverage dataset: 2571 rows x 33 columns.
Evaluation dataset: 267 rows x 33 columns.
Mostly numeric, with one categorical variable (Brand Code).

  • NAs:
    Brand Code: ~4.7% missing in beverage_df, ~3.0% missing in eval_df.
    Many numeric features have a small percentage of missing values, generally < 2%.
    PH: 4 NAs, the target variable in beverage_df. eval_df has all values missing for PH (used for prediction).
    Impute missing values for numeric features using median or mean (or more advanced imputation if correlated features exist). Missing values should be handled within a robust pipeline to avoid information loss.
    Brand Code can be imputed or left as-is, depending on the percentage missing per brand.

  • Skewness and outliers
    Variables like Mnf Flow have extreme values (mean is heavily influenced by outliers). Range: -100.2 to 229.4.
    PH: Check for extreme pH values that may affect model accuracy.
    Hyd Pressure1–4: High standard deviations (e.g., Hyd Pressure2 with a mean of ~21 and SD of 16.4).
    Analyze the distribution of these variables using histograms or boxplots. Maybe winsorization or BoxCox/log transformation for skewed distributions.

  • Feature Importance for PH pred
    Carb Volume, Fill Ounces, PC Volume, Carb Pressure, and Carb Temp have small sd and are likely controlled manufacturing parameters. These might directly influence pH.
    Brand Code can be treated as a categorical predictor for brand-specific variations.
    Correlation or feature importance to identify which variables most strongly influence PH.

  • Brand Code: 4 levels (A, B, C, D).
    Unbalanced distribution: B accounts for ~50% of records, while A, C, and D are much smaller. The imbalance might affect models like decision trees or ensemble methods.
    Apply stratified sampling or weighting to handle imbalance during training. Explore interaction effects between Brand Code and numeric variables.

  • Multicollinearity
    Variables such as Carb Volume, PC Volume, and Carb Pressure might be correlated due to their role in carbonation.
    Multiple pressure-related variables (Carb Pressure1, Fill Pressure, etc.) and filler speed/level metrics could also have collinear relationships.
    Compute a correlation matrix to detect highly correlated predictors.
    Principal Component Analysis (PCA) or Variance Inflation Factor (VIF) to handle multicollinearity.

  • Data Leakage
    Variables like PSC, PSC Fill, and PSC CO could be downstream measures dependent on pH. Confirm whether these are part of the production process or outcome metrics.
    Analyze the production process to ensure no data leakage into the model.

  • eval_df
    All PH values are missing for prediction. Maybe remove this column for now.
    Structure and missingness are similar to beverage_df. Ensure preprocessing and feature engineering pipelines are consistent between training and evaluation datasets.

  • Feature engineering
    Ratios: Carb Volume / PC Volume (efficiency metrics)
    Differences: Carb Pressure - Fill Pressure (pressure loss)
    One-hot encoding for Brand Code
    Binning continuous variables (e.g., temperature ranges)
    I assume there is no timestamps, no need to add seasonality or shift-based features.

  • Modeling Considerations
    Use feature scaling:variables like Carb Flow and Filler Speed have very different ranges and should be normalized or scaled for models like SVM or neural networks.
    If PH is skewed, maybe log or BoxCox transforms it for models sensitive to distribution (e.g., linear regression).

#Check first rows of data
DT::datatable(
      beverage_df[1:10,],
      options = list(scrollX = TRUE,
                     deferRender = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     info = FALSE,  
                     paging=FALSE,
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 1: First 10 Rows of Beverage Data'
  )) 
DT::datatable(
      eval_df[1:10,],
      options = list(scrollX = TRUE,
                     deferRender = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     info = FALSE,      
                     paging=FALSE,
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 2: First 10 Rows of Evaluation Data'
  )) 
#Summary statistics
DT::datatable(
      dfSummary(beverage_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 3: Summary Statistics for Beverage Data'
  )) 
#Summary statistics
DT::datatable(
      dfSummary(eval_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 4: Summary Statistics for Evaluation Data'
  )) 
stat <- beverage_df %>%
  group_by(`Brand Code`) %>%
  filter(!is.na(`Brand Code`)) %>% 
  dplyr::summarize(
    Min = min(PH, na.rm = TRUE),
    Q1 = quantile(PH, 0.25, na.rm = TRUE),
    Median = median(PH, na.rm = TRUE),
    Mean = mean(PH, na.rm = TRUE),
    Q3 = quantile(PH, 0.75, na.rm = TRUE),
    Max = max(PH, na.rm = TRUE),
    StdDev = sd(PH, na.rm = TRUE),
    Count = n(),
    Missing = sum(is.na(PH)) # Add the count of NA values
  )

#Summary statistics by code
DT::datatable(
      stat,
      options = list(dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE,
                     paging=FALSE,
                     info = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 5: Summary Statistics for Each Brand Code'
  )) %>%
  DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)

1.3 EDA

The distribution of PH is roughly normal, centered around 8.5, with some outliers at the lower and upper tails. Gaussian distribution?
Brand B has significantly more entries compared to others. Balancing or stratified sampling during model training?
PH distribution across Brand Codes:
Brand C has the lowest median PH, while Brand B has the highest.
Outliers are present in all brand codes. Investigate whether these outliers are measurement errors or valid extreme cases.

ggplot(beverage_df, aes(x = PH)) +
  geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of PH", x = "PH", y = "Frequency")

ggplot(beverage_df, aes(x = `Brand Code`)) +
  geom_bar(fill = "lightgreen") +
  theme_minimal() +
  labs(title = "Count of Entries by Brand Code", x = "Brand Code", y = "Count")

ggplot(beverage_df, aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of PH by Brand Code", x = "Brand Code", y = "PH")

#ggplot(beverage_df, aes(x = `Carb Pressure`, y = PH)) +
#  geom_point(alpha = 0.6) +
#  theme_minimal() +
#  labs(title = "Scatterplot: Carb Pressure vs PH", x = "Carb Pressure", y = "PH")

MFR, Fill Speed, Hyd Pressure3, and Pressure Setpoint have outliers.
PC Volume and PSC CO2 show categorical-like distributions.
Mnf Flow has a long-tailed distribution.
Filler Speed, Hyd Pressure2, and Usage Cont have sharp peaks.
Hyd Pressure4 has an interesting spread, insight into process variability.
MFR and Oxygen Filler are heavily right-skewed.
Log or Box-Cox transformation for variables like MFR and Usage Cont to normalize.
Group categorical-like numeric variables (e.g., PSC CO2, Carb Volume) to improve interpretability.

numeric_vars <- beverage_df %>% 
  dplyr::select(where(is.numeric)) %>% 
  names()

#3 groups
group_1 <- numeric_vars[1:10]
group_2 <- numeric_vars[11:20]
group_3 <- numeric_vars[21:32]

#Group 1 plot
beverage_df %>%
  dplyr::select(all_of(group_1)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black") +
  facet_wrap(~name, scales = "free", ncol = 5) +
  theme_minimal() +
  labs(title = "Distribution of Variables: Group 1")

# Group 2 Plot
beverage_df %>%
  dplyr::select(all_of(group_2)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black") +
  facet_wrap(~name, scales = "free", ncol = 5) +
  theme_minimal() +
  labs(title = "Distribution of Variables: Group 2")

# Group 3 Plot
beverage_df %>%
  dplyr::select(all_of(group_3)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black") +
  facet_wrap(~name, scales = "free", ncol = 5) +
  theme_minimal() +
  labs(title = "Distribution of Variables: Group 3")

The first two principal components explain approximately 20.16% and 18.74% of the variance. Some dimensionality reduction?
Variables close to each other (e.g., Balling, Balling Lvl, Density, and Carb Rel) are strongly correlated. Dim1: Balling, Balling Lvl, Density, Carb Rel Dim2: Hyd Pressure3, Carb Pressure1, Fill Speed, and PH. Use them for modeling.
Filler Level, Bowl Setpoint, and PH are grouped together, they may jointly influence the PH outcome.
Mnf Flow and Hyd Pressure3 are distant, unique variance.
Variables near the origin maybe drop.
Combine strongly correlated (e.g., Carb Rel, Carb Flow, Density) or reduce using PCA components or clustering, factor analysis.

#PCA
pca <- PCA(beverage_df %>% dplyr::select(where(is.numeric)) %>% na.omit(), scale.unit = TRUE)

MFR highest percentage of missing values, around 7%. It is either a challenging variable to collect or might not be consistently relevant across all rows.
mean, median, or a model-based imputation if MFR is critical. If it does not strongly correlate with the target or other features, drop?
When <3% missingness (PSC CO2, PC Volume, etc.), simple imputation techniques.
For Brand Code, mode imputation.

#NA
gg_miss_var(beverage_df, show_pct = TRUE)

#Outliers
boxplot(beverage_df$PH, main = "Boxplot of PH", horizontal = TRUE)

numeric_vars <- beverage_df %>%
  dplyr::select(where(is.numeric)) %>%
  dplyr::select(-PH) %>%
  names()

for (var in numeric_vars) {
  p <- ggplot(beverage_df, aes_string(x = paste0("`", var, "`"), y = "PH")) +
    geom_point(alpha = 0.6, color = "blue") +
    theme_minimal() +
    labs(title = paste("Scatterplot:", var, "vs PH"), x = var, y = "PH")
  
  print(p)  # Display the plot
}

1.4 Correlation Matrix

Bowl Setpoint (0.36) and Filler Level (0.35) show the strongest positive correlations with pH. These features might be significant predictors in any pH prediction model.
Carb Flow (0.23) and Pressure Vacuum (0.22) also have moderate positive correlations with pH.
Additional variables such as Carb Rel (0.20) and Alch Rel (0.17) show smaller but noteworthy correlations.
Usage Cont (-0.36) and Mnf Flow (-0.46) have strong negative correlations with pH, indicating these might inversely impact pH levels.
Other variables like Pressure Setpoint (-0.31) and Fill Pressure (-0.32) also negatively correlate with pH.
Many variables such as Carb Volume, PSC, and PSC Fill have very weak correlations (close to 0), suggesting they might not be influential in predicting pH.

Density, Balling, and Carb Rel show strong intercorrelations among themselves. Multicollinearity issues in linear regression models. Maybe PCA or feature selection.

Mnf Flow and Usage Cont have strong negative correlations, maybe additional preprocessing or transformation.

Interaction terms between Bowl Setpoint and Filler Level or between Pressure Vacuum and Carb Flow could also be explored, given their individual relationships with pH.

tst <- beverage_df %>% 
  dplyr::select(where(is.numeric))
#Correlation with PH
cor_table <- cor(drop_na(tst))[, "PH"] %>%
  as.data.frame() %>%
  rownames_to_column(var = "Variable") %>%
  rename(Coefficient = ".") %>%
  arrange(desc(Coefficient))

kable(cor_table, "html", escape = F, col.names = c('Variable', 'Coefficient')) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  add_header_above(c("Table: Correlation with PH" = 2))
Table: Correlation with PH
Variable Coefficient
PH 1.0000000
Bowl Setpoint 0.3615875
Filler Level 0.3520440
Carb Flow 0.2335937
Pressure Vacuum 0.2197355
Carb Rel 0.1960515
Alch Rel 0.1666822
Oxygen Filler 0.1644854
Balling Lvl 0.1093712
PC Volume 0.0988667
Density 0.0955469
Balling 0.0767002
Carb Pressure 0.0762134
Carb Volume 0.0721325
Carb Temp 0.0322794
Air Pressurer -0.0079972
PSC Fill -0.0238098
Filler Speed -0.0408830
MFR -0.0451965
Hyd Pressure1 -0.0470664
PSC -0.0698730
PSC CO2 -0.0852599
Fill Ounces -0.1183359
Carb Pressure1 -0.1187642
Hyd Pressure4 -0.1714340
Temperature -0.1826596
Hyd Pressure2 -0.2226600
Hyd Pressure3 -0.2681018
Pressure Setpoint -0.3116639
Fill Pressure -0.3165145
Usage cont -0.3576120
Mnf Flow -0.4592313
#Corr matrix
rcore <- rcorr(as.matrix(tst))
coeff <- rcore$r
#Filter to include only |r| > 0.1, the rest are 0
filtered_coeff <- coeff
filtered_coeff[abs(filtered_coeff) < 0.1] <- 0 
coeff <- rcore$r
corrplot(filtered_coeff, tl.cex = .6, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.45, diag=FALSE)

#Heatmap maybe
#numeric_cols <- beverage_df %>%
#  select(where(is.numeric)) %>%
#  na.omit()

#correlation_matrix <- cor(numeric_cols)
#corrplot::corrplot(correlation_matrix, method = "color", type = "upper", diag = FALSE)

2. Data Transformation

clean_bev <- beverage_df

nearzero <- nearZeroVar(clean_bev, saveMetrics= TRUE)
nearzero[nearzero$nzv,][1:5,] %>% drop_na()
##               freqRatio percentUnique zeroVar  nzv
## Hyd Pressure1  31.11111      9.529366   FALSE TRUE

MFR shows 7% missing - we’ll remove this feature Hyd Pressure1 shows near zero variance - we can remove this feature

# Remove 2 fields
cleantrain <- clean_bev %>%
  dplyr::select(-c(MFR, `Hyd Pressure1`))

# remove the fields from our evaluation data
cleaneval <- eval_df %>%
  dplyr::select(-c(MFR, `Hyd Pressure1`))

cleaneval<- cleaneval%>%
  dplyr::select(-PH)
set.seed(100)

# drop rows with missing PH
cleantrain <- cleantrain  %>%
  filter(!is.na(PH))

# Change Brand Code missing to 'Unknown' in our training data
brandcode <- cleantrain %>%
  dplyr::select(`Brand Code`) %>%
  replace_na(list(`Brand Code` = 'Unknown'))

cleantrain$`Brand Code` <- brandcode$`Brand Code`

# Change Brand Code missing to 'Unknown' 
brandcodeev <- cleaneval  %>%
  dplyr::select(`Brand Code`) %>%
  replace_na(list(`Brand Code` = 'Unknown'))

cleaneval$`Brand Code` <- brandcodeev$`Brand Code`

# change an unseen data as well
codes <- unique(cleantrain$`Brand Code`)

cleaneval <- cleaneval %>%
  mutate(`Brand Code`  = if_else(`Brand Code` %in% codes, `Brand Code`, 'Unknown'))

# Use the kNN imputing method from VIM package to impute missing values in our training data
cleantrain <- cleantrain %>% 
  kNN(k=10) %>%
  dplyr::select(colnames(cleantrain))

# # Use the kNN imputing method from VIM package to impute missing values in our training data
cleaneval<- cleaneval %>%
  kNN(k=10) %>%
  dplyr::select(colnames(cleaneval))

Feature Engineering

Add dummy variables for Brand Code

cleantrain_dummy <- dummyVars(PH ~ `Brand Code`, data = cleantrain)
dummies <- predict(cleantrain_dummy, cleantrain)


dummy_cols <- sort(colnames(dummies))

# columns are sorted 
dummies <- as.tibble(dummies) %>%
  dplyr::select(dummy_cols)

# remove the original categorical feature
cleantrain <- cleantrain %>%
  dplyr::select(-`Brand Code`)

# add the new dummy columns to  train
cleantrain <- cbind(dummies, cleantrain)

# -----
# Evaluation data - Convert our `Brand Code` column into a set of dummy variables

cleaneval$PH <- 1
eval_dummies <- predict(cleantrain_dummy, cleaneval)

# if not found add 0

for (c in dummy_cols) {
  if (!(c %in% colnames(eval_dummies))) {
    eval_dummies[c] <- 0
  }
}


dummy_cols <- sort(colnames(eval_dummies))
eval_dummies <- as.tibble(eval_dummies) %>%
  dplyr::select(dummy_cols)

# remove the original categorical feature
cleaneval <- cleaneval %>%
  dplyr::select(-c(`Brand Code`, PH))

# add the new dummy columns to our main eval dataframe
cleaneval <- cbind(eval_dummies, cleaneval)
library(caret)
preProcValues2 <- preProcess(cleantrain, method = "BoxCox")
train_bc <- predict(preProcValues2, cleantrain)

# preProcValues3 <- preProcess(cleaneval, method = "BoxCox")
# test_bc <- predict(preProcValues3, cleaneval )
# preProcValues2
# preProcValues3
 #Prepare data for ggplot
gatherdf <- train_bc %>% 
  dplyr::select(-c(PH)) %>%
  gather(key = 'variable', value = 'value')

# Histogram plots of each variable
ggplot(gatherdf) + 
  geom_histogram(aes(x=value, y = after_stat(density)), bins=50) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=5)

Build Models

  • Start with Linear Regression after transformation and dimensionality reduction.
  • Random Forest or Gradient Boosting for multicollinearity, so no dimensionality reduction but informed variable selection. -SVM
  • Regularized regression models, need to remove redundant features to reduce noise.
set.seed(100)

# utilizing one dataset for all four models
set <- createDataPartition(train_bc$PH, p=0.7, list=FALSE)
train <- train_bc[set,]
test <- train_bc[-set,]

Model 1 - Linear Model on Train data

set.seed(100)

#install.packages("doParallel")
library(doParallel)
library(MASS)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(200)


model1_full_lm <- lm(PH ~ ., train )

# model1_full_lm  - (using stepAIC)
model1 <- stepAIC(model1_full_lm , direction = "both",
                         scope = list(upper = model1_full_lm , lower = ~ 1),
                         scale = 0, trace = FALSE)

stopCluster(cl)

# Display Model 1 Summary
(lms <- summary(model1))
## 
## Call:
## lm(formula = PH ~ `\`Brand Code\`B` + `\`Brand Code\`C` + `\`Brand Code\`D` + 
##     `Carb Volume` + `Fill Ounces` + `PC Volume` + `Carb Temp` + 
##     `PSC CO2` + `Mnf Flow` + `Carb Pressure1` + `Hyd Pressure2` + 
##     `Hyd Pressure3` + `Filler Level` + Temperature + `Usage cont` + 
##     `Carb Flow` + Density + Balling + `Pressure Vacuum` + `Oxygen Filler` + 
##     `Bowl Setpoint` + `Pressure Setpoint` + `Air Pressurer` + 
##     `Carb Rel` + `Balling Lvl`, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6481 -0.6695  0.0674  0.7153  4.0027 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.172e+04  3.515e+04   2.325 0.020186 *  
## `\\`Brand Code\\`B`  5.062e-01  1.185e-01   4.273 2.03e-05 ***
## `\\`Brand Code\\`C` -6.651e-01  1.270e-01  -5.237 1.82e-07 ***
## `\\`Brand Code\\`D`  5.351e-01  1.064e-01   5.030 5.41e-07 ***
## `Carb Volume`       -1.231e+02  7.335e+01  -1.679 0.093386 .  
## `Fill Ounces`       -2.481e-02  1.308e-02  -1.896 0.058153 .  
## `PC Volume`         -6.680e-01  3.433e-01  -1.945 0.051875 .  
## `Carb Temp`          3.954e+03  2.584e+03   1.530 0.126112    
## `PSC CO2`           -1.316e+00  6.129e-01  -2.147 0.031956 *  
## `Mnf Flow`          -6.295e-03  4.694e-04 -13.411  < 2e-16 ***
## `Carb Pressure1`     3.849e-01  4.561e-02   8.440  < 2e-16 ***
## `Hyd Pressure2`     -1.432e-02  4.580e-03  -3.127 0.001795 ** 
## `Hyd Pressure3`      3.564e-02  5.596e-03   6.369 2.42e-10 ***
## `Filler Level`      -9.309e-05  4.685e-05  -1.987 0.047094 *  
## Temperature         -4.777e+04  6.851e+03  -6.973 4.35e-12 ***
## `Usage cont`        -3.253e-03  5.511e-04  -5.902 4.29e-09 ***
## `Carb Flow`          1.186e-05  3.473e-06   3.415 0.000652 ***
## Density             -4.178e-01  2.615e-01  -1.598 0.110321    
## Balling             -1.967e+00  3.011e-01  -6.532 8.46e-11 ***
## `Pressure Vacuum`   -2.065e-01  6.600e-02  -3.128 0.001787 ** 
## `Oxygen Filler`     -2.204e-01  7.211e-02  -3.056 0.002276 ** 
## `Bowl Setpoint`      2.648e-04  4.986e-05   5.310 1.24e-07 ***
## `Pressure Setpoint` -5.927e+03  1.711e+03  -3.465 0.000543 ***
## `Air Pressurer`     -1.147e+05  7.036e+04  -1.630 0.103310    
## `Carb Rel`           1.454e+02  7.520e+01   1.933 0.053337 .  
## `Balling Lvl`        9.780e-01  1.404e-01   6.965 4.60e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.098 on 1773 degrees of freedom
## Multiple R-squared:  0.444,  Adjusted R-squared:  0.4362 
## F-statistic: 56.64 on 25 and 1773 DF,  p-value: < 2.2e-16
confint(model1)
##                             2.5 %        97.5 %
## (Intercept)          1.278301e+04  1.506653e+05
## `\\`Brand Code\\`B`  2.739016e-01  7.385804e-01
## `\\`Brand Code\\`C` -9.141735e-01 -4.160419e-01
## `\\`Brand Code\\`D`  3.264456e-01  7.437564e-01
## `Carb Volume`       -2.670009e+02  2.072922e+01
## `Fill Ounces`       -5.046880e-02  8.575203e-04
## `PC Volume`         -1.341335e+00  5.433517e-03
## `Carb Temp`         -1.113581e+03  9.022066e+03
## `PSC CO2`           -2.517577e+00 -1.135914e-01
## `Mnf Flow`          -7.216087e-03 -5.374781e-03
## `Carb Pressure1`     2.954619e-01  4.743594e-01
## `Hyd Pressure2`     -2.330666e-02 -5.339455e-03
## `Hyd Pressure3`      2.466589e-02  4.661792e-02
## `Filler Level`      -1.849890e-04 -1.196553e-06
## Temperature         -6.121057e+04 -3.433703e+04
## `Usage cont`        -4.333408e-03 -2.171747e-03
## `Carb Flow`          5.049152e-06  1.867102e-05
## Density             -9.307547e-01  9.513230e-02
## Balling             -2.557642e+00 -1.376377e+00
## `Pressure Vacuum`   -3.359063e-01 -7.702190e-02
## `Oxygen Filler`     -3.618069e-01 -7.894796e-02
## `Bowl Setpoint`      1.669546e-04  3.625515e-04
## `Pressure Setpoint` -9.282345e+03 -2.572205e+03
## `Air Pressurer`     -2.526640e+05  2.331970e+04
## `Carb Rel`          -2.092798e+00  2.928850e+02
## `Balling Lvl`        7.025916e-01  1.253336e+00
set.seed(100)

library(car)
plot(model1)

set.seed(100)
#install.packages("regclass")
library(regclass)

VIF(model1)
## `\\`Brand Code\\`B` `\\`Brand Code\\`C` `\\`Brand Code\\`D`       `Carb Volume` 
##            5.226957            2.541740            3.100344            3.811709 
##       `Fill Ounces`         `PC Volume`         `Carb Temp`           `PSC CO2` 
##            1.130737            1.405911            1.075614            1.024303 
##          `Mnf Flow`    `Carb Pressure1`     `Hyd Pressure2`     `Hyd Pressure3` 
##            4.714587            1.500022            8.393839           11.848889 
##      `Filler Level`         Temperature        `Usage cont`         `Carb Flow` 
##            8.484869            1.385065            1.661131            1.799201 
##             Density             Balling   `Pressure Vacuum`     `Oxygen Filler` 
##           10.203170           22.870040            2.144546            2.016515 
##     `Bowl Setpoint` `Pressure Setpoint`     `Air Pressurer`          `Carb Rel` 
##            8.908272            1.535529            1.205666            5.203412 
##       `Balling Lvl` 
##           22.142918

Model 1 - Linear Model on Test data

set.seed(100)

# Predict df_test and calculate performance
model1predict <- predict(model1, test)

# Merge the results into a data frame called results 
results <- data.frame()
results <- data.frame(t(postResample(pred = model1predict, obs = test$PH))) %>% 
  mutate(Model = "Mutiple Regression") %>% 
  rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
1.177533 0.3760906 0.9043881 Mutiple Regression

Ridge Regression

#install.packages("glmnet")
library(glmnet)
library(doParallel)
cl2 <- makePSOCKcluster(5)
registerDoParallel(cl2)
set.seed(200)

# to find the right lambda using cv.glmnet
xtrain <- model.matrix(PH ~ ., data = train)
xtest <- model.matrix(PH ~ ., data = test)

cv.glmnet <- cv.glmnet(xtrain, train$PH, alpha = 0)

ridgemdl <- glmnet(xtrain, train$PH, alpha = 0, lambda = cv.glmnet$lambda.min)

stopCluster(cl2)

summary(ridgemdl)
##           Length Class     Mode   
## a0         1     -none-    numeric
## beta      35     dgCMatrix S4     
## df         1     -none-    numeric
## dim        2     -none-    numeric
## lambda     1     -none-    numeric
## dev.ratio  1     -none-    numeric
## nulldev    1     -none-    numeric
## npasses    1     -none-    numeric
## jerr       1     -none-    numeric
## offset     1     -none-    logical
## call       5     -none-    call   
## nobs       1     -none-    numeric
# Predict df_test and calculate performance
ridge <- predict(ridgemdl, xtest)

results <- data.frame(t(postResample(pred = ridge, obs = test$PH))) %>% 
    dplyr::mutate(Model = "Ridge Regression") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
1.174503 0.3779246 0.9089871 Ridge Regression
1.177533 0.3760906 0.9043881 Mutiple Regression

Elastic Net

cl4 <- makePSOCKcluster(5)
registerDoParallel(cl4)
set.seed(200)

# training the elastic net regression model using train
enet_model <- train(
                        PH ~ ., data = train, method = "glmnet",
                        trControl = trainControl("repeatedcv", repeats = 8),
                        tuneLength = 4
)

stopCluster(cl4)

summary(enet_model)
##             Length Class      Mode     
## a0            97   -none-     numeric  
## beta        3298   dgCMatrix  S4       
## df            97   -none-     numeric  
## dim            2   -none-     numeric  
## lambda        97   -none-     numeric  
## dev.ratio     97   -none-     numeric  
## nulldev        1   -none-     numeric  
## npasses        1   -none-     numeric  
## jerr           1   -none-     numeric  
## offset         1   -none-     logical  
## call           5   -none-     call     
## nobs           1   -none-     numeric  
## lambdaOpt      1   -none-     numeric  
## xNames        34   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list
plot(enet_model)

# Make predictions
enet_pred <- predict(enet_model, test)


results <- data.frame(t(postResample(pred=enet_pred, obs=test$PH))) %>% 
    mutate(Model = "ElasticNet Regression") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
1.172259 0.3812287 0.9038890 ElasticNet Regression
1.174503 0.3779246 0.9089871 Ridge Regression
1.177533 0.3760906 0.9043881 Mutiple Regression

Neural Net (avNNET)

cl5 <- makePSOCKcluster(5)
registerDoParallel(cl5)
set.seed(200)

# avvnet
nnetGrid <- expand.grid(.decay = c(0.1, 0.5), .size = c(1,10), .bag = FALSE)

nnet.mdl <- train(PH ~ ., data = train, method = "avNNet", preProcess = c("center", 
    "scale"), tuneGrid = nnetGrid, trControl = trainControl(method = "repeatedcv", 
    repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)

stopCluster(cl5)

summary(nnet.mdl)
##             Length Class      Mode     
## model        5     -none-     list     
## repeats      1     -none-     numeric  
## bag          1     -none-     logical  
## seeds        5     -none-     numeric  
## names       34     -none-     character
## terms        3     terms      call     
## coefnames   34     -none-     character
## xlevels      0     -none-     list     
## xNames      34     -none-     character
## problemType  1     -none-     character
## tuneValue    3     data.frame list     
## obsLevels    1     -none-     logical  
## param        3     -none-     list
plot(nnet.mdl)

# Predict df_test and calculate performance
nnetpred <- predict(nnet.mdl, newdata = test)
results <- data.frame(t(postResample(pred = nnetpred, obs = test$PH))) %>% 
    mutate(Model = "Neural Network (avNNET )") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
0.9734992 0.5752473 0.7160752 Neural Network (avNNET )
1.1722589 0.3812287 0.9038890 ElasticNet Regression
1.1745032 0.3779246 0.9089871 Ridge Regression
1.1775326 0.3760906 0.9043881 Mutiple Regression

KNN

cl6 <- makePSOCKcluster(5)
registerDoParallel(cl6)
set.seed(200)

knnMdl <- train(PH ~ ., data = train, method = "knn", preProc = c("center","scale"), tuneLength = 10)

stopCluster(cl6)

summary(knnMdl)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      34     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    1     -none-     logical  
## param        0     -none-     list
plot(knnMdl)

# Predict df_test and calculate performance
knn <- predict(knnMdl, newdata = test)
results <- data.frame(t(postResample(pred = knn, obs = test$PH))) %>% 
    mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
1.0598769 0.4952261 0.7785023 k-Nearest Neighbors(kNN)
0.9734992 0.5752473 0.7160752 Neural Network (avNNET )
1.1722589 0.3812287 0.9038890 ElasticNet Regression
1.1745032 0.3779246 0.9089871 Ridge Regression
1.1775326 0.3760906 0.9043881 Mutiple Regression

#GBM

library(caret)
train1 <- train_bc %>% dplyr::select (-PH)

X.train <- train1[set, ]
y.train <- train_bc$PH[set]
X.test <- train1[-set, ]
y.test <- train_bc$PH[-set]

cl7 <- makePSOCKcluster(5)
registerDoParallel(cl7)
set.seed(200)

grid <- expand.grid(n.trees = c(50, 100, 150, 200),
                    interaction.depth = c(1, 5, 10, 15),
                    shrinkage = c(0.01, 0.1, 0.5),
    n.minobsinnode = c(5, 10, 15))

gbm_Mdl <- train(x = X.train,
                   y = y.train,
                   method = "gbm",
                   tuneGrid = grid,
                   verbose = FALSE
)

stopCluster(cl7)
summary(gbm_Mdl )

##                                     var     rel.inf
## Mnf Flow                       Mnf Flow 17.50513869
## Oxygen Filler             Oxygen Filler  5.21618716
## `Brand Code`C             `Brand Code`C  5.15464656
## Usage cont                   Usage cont  4.82667539
## Carb Pressure1           Carb Pressure1  4.81167716
## Alch Rel                       Alch Rel  4.21118398
## Air Pressurer             Air Pressurer  4.20008387
## Pressure Vacuum         Pressure Vacuum  4.17488606
## Temperature                 Temperature  3.91465311
## Filler Speed               Filler Speed  3.57973207
## Balling Lvl                 Balling Lvl  3.46059110
## Carb Flow                     Carb Flow  3.39213631
## Density                         Density  3.29697641
## Carb Rel                       Carb Rel  3.16671830
## Fill Ounces                 Fill Ounces  2.22907630
## Balling                         Balling  2.19179651
## PC Volume                     PC Volume  2.14110063
## Carb Volume                 Carb Volume  1.95651762
## Hyd Pressure4             Hyd Pressure4  1.92470429
## Hyd Pressure2             Hyd Pressure2  1.87409118
## PSC                                 PSC  1.80071793
## Bowl Setpoint             Bowl Setpoint  1.73877297
## Fill Pressure             Fill Pressure  1.73621500
## Hyd Pressure3             Hyd Pressure3  1.73141879
## Carb Pressure             Carb Pressure  1.69627016
## PSC Fill                       PSC Fill  1.68639033
## Filler Level               Filler Level  1.67024046
## Carb Temp                     Carb Temp  1.54919810
## PSC CO2                         PSC CO2  1.15678322
## `Brand Code`B             `Brand Code`B  0.68534995
## Pressure Setpoint     Pressure Setpoint  0.64945532
## `Brand Code`A             `Brand Code`A  0.37706909
## `Brand Code`Unknown `Brand Code`Unknown  0.22264230
## `Brand Code`D             `Brand Code`D  0.07090368
plot(gbm_Mdl )

gbm_Mdl$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 88     200                15       0.1              5
gbm_Mdl$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 34 predictors of which 34 had non-zero influence.
# Predict df_test and calculate performance

gbm <- predict(gbm_Mdl, newdata = test)
# y_pred <- predict(xgb,  data.matrix(X.test[,-1]))
results <- data.frame(t(postResample(pred = gbm, obs = test$PH))) %>% mutate(Model = "Generalized Boosted Models") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
0.8928378 0.6405441 0.6664880 Generalized Boosted Models
1.0598769 0.4952261 0.7785023 k-Nearest Neighbors(kNN)
0.9734992 0.5752473 0.7160752 Neural Network (avNNET )
1.1722589 0.3812287 0.9038890 ElasticNet Regression
1.1745032 0.3779246 0.9089871 Ridge Regression
1.1775326 0.3760906 0.9043881 Mutiple Regression
options(max.print = 1e+06)

cl8 <- makePSOCKcluster(5)
registerDoParallel(cl8)
set.seed(200)

mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)

mars.model <- train(x = X.train, y = y.train, method = "earth", tuneGrid = mars.grid, 
    preProcess = c("center", "scale"), tuneLength = 10)

stopCluster(cl8)

summary(mars.model)
## Call: earth(x=data.frame[1799,34], y=c(34.44,33.61,3...), keepxy=TRUE,
##             degree=2, nprune=15)
## 
##                                                     coefficients
## (Intercept)                                            34.505258
## h(-0.187665-Mnf Flow)                                   2.268911
## h(1.55802-Temperature)                                  0.381931
## h(Balling-0.666758)                                     1.813780
## \\Brand Code\\C * h(Hyd Pressure3-0.713986)            -0.934977
## \\Brand Code\\C * h(0.713986-Hyd Pressure3)            -0.316747
## h(-0.187665-Mnf Flow) * h(1.52742-Carb Pressure1)      -0.285200
## h(-0.187665-Mnf Flow) * h(Pressure Vacuum-0.366229)    -0.550243
## h(-0.187665-Mnf Flow) * h(0.366229-Pressure Vacuum)    -0.540903
## h(-0.187665-Mnf Flow) * h(Air Pressurer-0.823362)      -0.505415
## h(Mnf Flow- -0.187665) * h(Air Pressurer-0.32098)       0.442730
## h(1.55802-Temperature) * h(Usage cont-0.231278)        -0.329434
## h(Density-1.06311) * h(Oxygen Filler- -1.05263)        -1.181989
## h(0.666758-Balling) * h(Bowl Setpoint- -1.34819)        0.304218
## h(Balling-0.666758) * h(1.62844-Alch Rel)              -1.023213
## 
## Selected 15 of 32 terms, and 13 of 34 predictors (nprune=15)
## Termination condition: RSq changed by less than 0.001 at 32 terms
## Importance: `Mnf Flow`, `\`Brand Code\`C`, `Hyd Pressure3`, Balling, ...
## Number of terms at each degree of interaction: 1 3 11
## GCV 1.104934    RSS 1909.016    GRSq 0.4836855    RSq 0.5035911
# Predict df_test and calculate performance
mars <- predict(mars.model, newdata = X.test)
results <- data.frame(t(postResample(pred = mars, obs = y.test))) %>% mutate(Model = "Multivariate Adaptive Reg
                                                                             ression Splines (MARS)") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
1.1203560 0.4367615 0.8620190 Multivariate Adaptive Reg
ression Splines (MARS)
0.8928378 0.6405441 0.6664880 Generalized Boosted Models
1.0598769 0.4952261 0.7785023 k-Nearest Neighbors(kNN)
0.9734992 0.5752473 0.7160752 Neural Network (avNNET )
1.1722589 0.3812287 0.9038890 ElasticNet Regression
1.1745032 0.3779246 0.9089871 Ridge Regression
1.1775326 0.3760906 0.9043881 Mutiple Regression
cl9 <- makePSOCKcluster(5)
registerDoParallel(cl9)
set.seed(200)

cubist_Model <- train(x = X.train, y = y.train, method = "cubist")

stopCluster(cl9)
# Predict df_test and calculate performance
Cubist <- predict(cubist_Model, newdata = X.test)
results <- data.frame(t(postResample(pred = Cubist, obs = y.test))) %>% mutate(Model = "Cubist Model") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
0.9116417 0.6287763 0.6811602 Cubist Model
1.1203560 0.4367615 0.8620190 Multivariate Adaptive Reg
ression Splines (MARS)
0.8928378 0.6405441 0.6664880 Generalized Boosted Models
1.0598769 0.4952261 0.7785023 k-Nearest Neighbors(kNN)
0.9734992 0.5752473 0.7160752 Neural Network (avNNET )
1.1722589 0.3812287 0.9038890 ElasticNet Regression
1.1745032 0.3779246 0.9089871 Ridge Regression
1.1775326 0.3760906 0.9043881 Mutiple Regression
set.seed(100)

library(randomForest)

rfModel <- randomForest(X.train, y.train,
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, X.test)

#postResample(rfPred, y.test)

results <- data.frame(t(postResample(pred = rfPred, obs = y.test))) %>% mutate(Model = "Random Forsest") %>% rbind(results)

knitr::kable(results)
RMSE Rsquared MAE Model
0.8498634 0.6907792 0.6238147 Random Forsest
0.9116417 0.6287763 0.6811602 Cubist Model
1.1203560 0.4367615 0.8620190 Multivariate Adaptive Reg
ression Splines (MARS)
0.8928378 0.6405441 0.6664880 Generalized Boosted Models
1.0598769 0.4952261 0.7785023 k-Nearest Neighbors(kNN)
0.9734992 0.5752473 0.7160752 Neural Network (avNNET )
1.1722589 0.3812287 0.9038890 ElasticNet Regression
1.1745032 0.3779246 0.9089871 Ridge Regression
1.1775326 0.3760906 0.9043881 Mutiple Regression
rfImp <- varImp(rfModel, scale = TRUE) %>%
  as.data.frame()

rfImp %>%
  arrange(-Overall) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box()
Overall
Mnf Flow 55.4294548
Brand CodeC 48.5689924
Oxygen Filler 48.4671185
Pressure Vacuum 47.2839766
Temperature 39.0917386
Air Pressurer 38.5935464
Usage cont 36.8814501
Filler Speed 36.5237251
Carb Rel 35.8996462
Balling Lvl 33.9379320
Alch Rel 32.7328034
Carb Flow 32.2676163
Balling 31.1178814
Carb Pressure1 30.7906984
Density 27.5308937
Bowl Setpoint 25.1728767
Hyd Pressure3 23.1533251
Filler Level 23.1506157
Brand CodeB 21.3900442
Carb Volume 20.8766737
PC Volume 18.8554462
Hyd Pressure2 18.6143465
Fill Pressure 16.8877679
Hyd Pressure4 16.8676225
Brand CodeD 16.6718170
Brand CodeA 14.1870874
Pressure Setpoint 14.0282006
Brand CodeUnknown 13.0502763
Fill Ounces 5.7929613
Carb Pressure 4.4967456
Carb Temp 3.1029152
PSC Fill 1.7036334
PSC CO2 0.2873638
PSC -0.7007273
varImpPlot(rfModel, sort = TRUE, n.var = 10)

top10 <- varImp(rfModel) %>%
  filter(Overall < 57) %>%
  arrange(-Overall) %>%
  head(10)


cleantrain %>% 
  dplyr::select(c("PH", row.names(top10))) %>%
  cor() 
##                          PH     Mnf Flow `Brand Code`C Oxygen Filler
## PH               1.00000000 -0.446975100  -0.280418076   0.171402067
## Mnf Flow        -0.44697510  1.000000000  -0.004129172  -0.537695437
## `Brand Code`C   -0.28041808 -0.004129172   1.000000000   0.045039296
## Oxygen Filler    0.17140207 -0.537695437   0.045039296   1.000000000
## Pressure Vacuum  0.22053722 -0.527885581  -0.060622215   0.227311133
## Temperature     -0.14731541 -0.092651655   0.202061285   0.148297951
## Air Pressurer   -0.01365873 -0.049374969  -0.020411142   0.098706380
## Usage cont      -0.31699873  0.521310419   0.005411482  -0.310887324
## Filler Speed    -0.02157477  0.150580883  -0.003874677  -0.132903757
## Carb Rel         0.16169918 -0.030561658  -0.240218983  -0.007825912
## Balling Lvl      0.09996422  0.036036226  -0.223972463  -0.055573624
##                 Pressure Vacuum Temperature Air Pressurer   Usage cont
## PH                  0.220537222 -0.14731541   -0.01365873 -0.316998730
## Mnf Flow           -0.527885581 -0.09265165   -0.04937497  0.521310419
## `Brand Code`C      -0.060622215  0.20206128   -0.02041114  0.005411482
## Oxygen Filler       0.227311133  0.14829795    0.09870638 -0.310887324
## Pressure Vacuum     1.000000000  0.04683150    0.16817638 -0.303452067
## Temperature         0.046831501  1.00000000    0.05603934 -0.115017934
## Air Pressurer       0.168176380  0.05603934    1.00000000 -0.094720140
## Usage cont         -0.303452067 -0.11501793   -0.09472014  1.000000000
## Filler Speed        0.015564308 -0.37782777   -0.01588814  0.075729750
## Carb Rel           -0.003370213 -0.09416766   -0.10101035 -0.030865283
## Balling Lvl        -0.043948462 -0.17841637   -0.08642802  0.042250201
##                 Filler Speed     Carb Rel Balling Lvl
## PH              -0.021574773  0.161699183  0.09996422
## Mnf Flow         0.150580883 -0.030561658  0.03603623
## `Brand Code`C   -0.003874677 -0.240218983 -0.22397246
## Oxygen Filler   -0.132903757 -0.007825912 -0.05557362
## Pressure Vacuum  0.015564308 -0.003370213 -0.04394846
## Temperature     -0.377827773 -0.094167660 -0.17841637
## Air Pressurer   -0.015888136 -0.101010352 -0.08642802
## Usage cont       0.075729750 -0.030865283  0.04225020
## Filler Speed     1.000000000 -0.052265643 -0.01494369
## Carb Rel        -0.052265643  1.000000000  0.84348310
## Balling Lvl     -0.014943687  0.843483101  1.00000000
#%>% corrplot() + title("Correlation between PH and the Top 10 Numerical Variables")
stat
## # A tibble: 4 × 10
##   `Brand Code`   Min    Q1 Median  Mean    Q3   Max StdDev Count Missing
##   <chr>        <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <int>   <int>
## 1 A             8.06  8.42   8.52  8.50  8.6   8.86  0.163   293       0
## 2 B             8     8.44   8.56  8.57  8.7   8.94  0.169  1239       4
## 3 C             7.88  8.3    8.42  8.41  8.54  9.36  0.177   304       0
## 4 D             8.16  8.5    8.62  8.60  8.7   8.92  0.135   615       0
set.seed(100)
library(mice)
#install.packages("writexl")
library(writexl)

#head(test_bc)

#test_predict <- predict(cubist_Model , newdata=cleaneval)

#head(test_predict)

eval_data <- cleaneval %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm=TRUE))))

ph_prediction <- predict(cubist_Model,eval_data)

eval_data$ph <- ph_prediction
head(eval_data)
##   `Brand Code`A `Brand Code`B `Brand Code`C `Brand Code`D `Brand Code`Unknown
## 1             0             0             0             1                   0
## 2             1             0             0             0                   0
## 3             0             1             0             0                   0
## 4             0             1             0             0                   0
## 5             0             1             0             0                   0
## 6             0             1             0             0                   0
##   Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp   PSC PSC Fill
## 1    5.480000    24.03333 0.2700000          65.4     134.6 0.236     0.40
## 2    5.393333    23.95333 0.2266667          63.2     135.0 0.042     0.22
## 3    5.293333    23.92000 0.3033333          66.4     140.4 0.068     0.10
## 4    5.266667    23.94000 0.1860000          64.8     139.0 0.004     0.20
## 5    5.406667    24.20000 0.1600000          69.4     142.2 0.040     0.30
## 6    5.286667    24.10667 0.2120000          73.4     147.2 0.078     0.22
##   PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure Hyd Pressure2 Hyd Pressure3
## 1    0.04     -100          116.6          46.0             0             0
## 2    0.08     -100          118.8          46.2             0             0
## 3    0.02     -100          120.2          45.8             0             0
## 4    0.02     -100          124.8          40.0             0             0
## 5    0.06     -100          115.0          51.4             0             0
## 6    0.04     -100          118.6          46.4             0             0
##   Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow
## 1            96        129.4         3986        66.0      21.66      2950
## 2           112        120.0         4012        65.6      17.60      2916
## 3            98        119.4         4010        65.6      24.18      3056
## 4           132        120.2         3997        74.4      18.12        28
## 5            94        116.0         4018        66.4      21.32      3214
## 6            94        120.4         4010        66.6      18.00      3064
##   Density Balling Pressure Vacuum Oxygen Filler Bowl Setpoint Pressure Setpoint
## 1    0.88   1.398            -3.8         0.022           130              45.2
## 2    1.50   2.942            -4.4         0.030           120              46.0
## 3    0.90   1.448            -4.2         0.046           120              46.0
## 4    0.74   1.056            -4.0         0.052           120              46.0
## 5    0.88   1.398            -4.0         0.082           120              50.0
## 6    0.84   1.298            -3.8         0.064           120              46.0
##   Air Pressurer Alch Rel Carb Rel Balling Lvl       ph
## 1         142.6     6.56     5.34        1.48 38.62435
## 2         147.2     7.14     5.58        3.04 38.84820
## 3         146.6     6.52     5.34        1.46 38.61751
## 4         146.4     6.48     5.50        1.48 38.63840
## 5         145.8     6.50     5.38        1.46 38.61926
## 6         146.0     6.50     5.42        1.44 38.63348
write_xlsx(eval_data, 'StudentEvaluation_final.xlsx')

hist(eval_data$ph)